Knowledge Base Population using Semantic Label Propagation
نویسندگان
چکیده
A crucial aspect of a knowledge base population system that extracts new facts from text corpora, is the generation of training data for its relation extractors. In this paper, we present a method that maximizes the effectiveness of newly trained relation extractors at a minimal annotation cost. Manual labeling can be significantly reduced by Distant Supervision (DS), which is a method to construct training data automatically by aligning a large text corpus with an existing knowledge base of known facts. For example, all sentences mentioning both ‘Barack Obama’ and ‘US’ may serve as positive training instances for the relation born in(subject,object). However, distant supervision typically results in a highly noisy training set: many training sentences containing the known entity pairs do not really express the intended relation. We explore the idea of combining DS with (partial) human supervision to eliminate that noise. This idea is not novel per se, but our key contributions are: (i) a novel method of filtering the DS training set based on labeling Shortest Dependency Paths, (SDPs), and (ii) the Semantic Label Propagation (SLP) model. We propose to combine DS with minimal manual human supervision by annotating features (in particular SDPs) rather than (potential) relation instances. Such so-called feature labeling is adopted to eliminate noise from the large and noisy initial training set, resulting in a significant increase of precision (at the expense of recall). We further improve on this approach by introducing the Semantic Label Propagation (SLP) method, which uses the similarity between low-dimensional representations of candidate training instances, to extend the (filtered) training set in order to increase recall while maintaining high precision. Our proposed strategy for generating training data is studied and evaluated on an established test collection designed for knowledge base population (KBP) tasks from the TAC KBP English slot filling task. The experimental results show that the SLP strategy leads to substantial performance gains when compared to ∗Corresponding author Email address: [email protected] (Lucas Sterckx) Preprint submitted to Knowledge Based Systems March 4, 2016 ar X iv :1 51 1. 06 21 9v 2 [ cs .C L ] 3 M ar 2 01 6 existing approaches, while requiring an almost negligible human annotation effort.
منابع مشابه
Building a Domain Knowledge Base from Wikipedia: a Semi-supervised Approach
Knowledge bases are becoming indispensable to software engineering and knowledge engineering. However, the existing domain knowledge bases are always artificially constructed and small-scale. In this paper, we propose a semi-supervised approach to domain concepts detection and software engineering knowledge base construction from Wikipedia. First, the approach selects domain relevant tags from ...
متن کاملQuery Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملParis: a Parallel Inference System 1
This paper presents an inferential system based on abductive interpretation of text. Inference to the best explanation is performed by the recognition of the most economic semantic paths produced by the propagation of markers on a very large linguistic knowledge base. The propagation of markers is controlled by their intrinsic propagation rules, devised from plausible semantic relation chains. ...
متن کاملReasoning with data flows and policy propagation rules
Data-oriented systems and applications are at the centre of current developments of the World Wide Web. In these scenarios, assessing what policies propagate from the licenses of data sources to the output of a given data-intensive system is an important problem. Both policies and data flows can be described with Semantic Web languages. Although it is possible to define Policy Propagation Rules...
متن کاملA Knowledge - based System Integrating Speech and Image Understanding – Manual Version 1 . 0 –
We present a system that integrates speech and image understanding in the domain of a construction task. Knowledge needed to interpret speech and images is stored in one homogeneous knowledge base using the semantic network formalism ERNEST. It provides a means for constraint propagation and therefore – using a uniform representation of the information gained from real input data – the speech a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Knowl.-Based Syst.
دوره 108 شماره
صفحات -
تاریخ انتشار 2016